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Abstract 

Consider the following multi-phase project management problem. Each project 
is divided into several phases. All projects enter the next phase at the same point 
chosen by the decision maker based on observations up to that point. Within each 
phase, one can pursue the projects in any order. When pursuing the project with 
one unit of resource, the project state changes according to a Markov chain. The 
probability distribution of the Markov chain is known up to an unknown parameter. 
When pursued, the project generates a random reward depending on the phase and 
the state of the project and the unknown parameter. The decision maker faces two 
problems: (a) how to allocate resources to projects within each phase, and (b) when 
to enter the next phase, so that the total expected reward is as large as possible. In 
this paper, we formulate the preceding problem as a stochastic scheduling problem 
and propose asymptotic optimal strategies, which minimize the shortfall from perfect 
information payoff. Concrete examples are given to illustrate our method. 
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1 Introduction 



Wc first formulate the multi-phase project management problem as that of optimally 
scheduling a number of jobs. Suppose that a single machine is available to process 
U jobs. Each job belongs to one job group and there are / job groups all together. 
Within each group, the job can be processed in any order. However, there exists a 
predetermined order among job groups. That is, after leaving the current job group, 
there is no return to it in the future. The state of a job under processing evolves 
as a Markov chain and earns rewards as it is processed, not otherwise. The time- 
varying reward distributions depends on an unknown parameter 6. The objective is to 
minimize the shortfall from perfect information payoff, which is the difference between 
the optimal reward when the parameter is known and that when it is unknown. We 
establish an asymptotic lower bound on this difference and construct policies which 
attain the lower bound. Clearly the preceding stochastic scheduling problem is the 
same as the multi-phase project management problem when we identify jobs in the 
same group with projects in the same phase. 

To solve the proposed stochastic scheduling problem, we need to resolve two is- 
sues. First, our solution must prescribe how to process jobs within the same group. 
Secondly, the solution needs to stipulate the timing of leaving the current job group 
and entering the next one. All existing methods address only one of the two issues. 
As one shall see, to address these two issues simultaneously requires new ideas as well 
as nontrivial combination of existing methods. 

The advantages of efficient strategies constructed in Section 4 is three-fold. 

• It addresses the two crucial issues described in the previous paragraph simulta- 
neously. 

• It is still optimal, if we consider constant switching cost from one project to 
another. 

• When the bad set (see Section 2.4 for definition) is empty the strategy is super 
efficient in the sense of attaining o(log A'^) regret (see Section 2.2 for definition). 

If the parameter 9 were known, the best policy would be to process only the job 
with greatest one-step expected reward. In ignorance of 9, an optimal policy needs 
to trade off a reduced reward in exchange for information on 9. The key to the 
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optimal trade-off is the construction of a strategy that achieves the asymptotic lower 
bound for the shortfall from the complete information payoff, which we shall refer 
to as regret hereafter. Although dynamic programming and the Gittins index rule 
(cf. Gittins, 1989) have been developed to solve a general class of adaptive control 
problems, to which the proposed problem belongs, computational difficulty makes 
them less applicable. One reason for adopting the approach described here is to 
obtain an explicit solution which is easy to implement. 

This approach was first introduced by Lai and Robbins (1985) and generalized by 
Anantharam, Varaiya and Walrand (1987) and Lai (1987). When there is only one 
job group and the rewards from each job are independent and identically distributed 
(i.i.d.), the preceding control problem is the classical multi-armed bandit problem; see 
Robbins (1952), Berry and Pristedt (1985) and Gittins (1989). When there is only one 
job in each group and rewards are i.i.d., it is the irreversible multi-arm bandit problem 
studied by Hu and Wei (1989), whereas Hu and Lee (2003) considered the same 
problem under a Bayesian setting. Fuh and Hu (2000) investigated the irreversible 
multi-armed bandit problem with Markovian rewarding. Agrawal, Teneketzis and 
Anantharam (1989a, b) studied controlled i.i.d. processes and Markov chains in finite 
parameter and state spaces. They introduced the concept of bad sets and showed 
that it plays an important role to the solution of the adaptive control problem. Other 
related works can be found in Kadanc and Simon (1977), Mandelbaum and Vanderbei 
(1981), Gittins (1989), Presman and Sonin (1990), Glazebrook (1991, 1996), Graves 
and Lai (1997) and references therein. 

The rest of the paper is organized as follows. In Section 2, we describe the com- 
ponents of a statistical model for the proposed problem. The asymptotic lower bound 
for the regret is derived in Section 3. In Section 4, we propose a class of strategies 
making use of an adjusted MLE 9a- This adjustment is necessary for consistent esti- 
mation of the bad sets of 9 when the parameter space is continuous. The efficiency of 
our procedure relies on an initial experimentation stage based on the adjusted MLE 
estimate to maximize the information content and also on a subsequent testing stage 
via sequential likelihood ratio tests to reject suboptimal jobs or a whole group of jobs. 
Unequal allocation of processing time on jobs may occur in the testing stage so that 
there is more frequent processing of superior jobs. In Section 5, we discuss how our 
method can be applied to multi-phase project management examples. Most of the 



2 



technical proofs are deferred to the Appendix. 

2 Preliminaries 

2.1 The scheduling problem 

Let U = Ji + ■ ■ ■ + Ji indicate that there are / groups and Jj jobs in the ith group 
for ? = 1, ... ,7. One is free to process any job within the same group, while jobs 
must be processed following the order of 1, ... ,7 between groups. As processing a 
job a unit time is equivalent to taking an observation from a statistical population, 
we have U statistical populations Hn, . . . For each ij, the observations from 

Hij follow a Markov chain on a state space D with cr-algebra V. It is assumed that 
the transition probability Pfj for the Markov chain has a probability density function 
Pij{x, y; 9) with respect to some nondegenerate measure Q, where Pij{x, y; •) is known 
and 6 is an unknown parameter belonging to a parameter space Q. We assume that 
the stationary probability distribution for the Markov chain exists and has probability 
density function Trij{-;9) with respect to Q. At each step, we are required to process 
one job respecting the partial order ij :< i'j' i < i'. 

An adaptive policy is a rule that dictates, at each step, which job should be 
processed based on information from previous observations. We can represent a policy 
as a sequence of random variables (f) = {(pt} taking values in {ij : i = 1, - ■ ■ ,1; j = 
l,---,Ji}, such that the event = ij} (process job ij at step t) belongs to the 
cr-field J-'t-i generated by <pi,Xi,. . . ,(f)t-\^Xt-\, where X„ denotes the state of the 
job being processed at the nth step. The constraint 

(2.1) <^t^0m for l<i<A^-l, 

indicates that once a sample has been taken from 11^, one can switch to other jobs 
within group i or to the jobs in groups i + 1 to 7, but no further sampling is allowed 
from Hii, . . . ,n(j_i)j._^. 

2.2 The objective function 

Let the initial state of the job ij under processing be distributed according to 0). 
Throughout this paper, we shall use the notation Eg (Pg) to denote expectation 
(probability) with respect to the initial distribution fij(-; 9); similarly, to denote 
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expectation with respect to Pe and the stationary distribution ■Kij{-;9). We shall 
assume that Vij = {x ^ D : Uij{x; 9) > 0} does not depend on 9 and 

(2.2) Vij := inf inf [i/ij{x;9)/vij{x;9')] > for all 

xeVij 0,6'e& 

Suppose that J^^d \g{x)\7Tij{x;9)Q{dx) < oo for some real-valued function (reward) 
g. Let 



= g{x)TTij{x;9)Q{dx) 
JxeD 

be the mean reward under stationary distribution TTjj if job ij is processed once. Let 
be the total processing time for all jobs, and 

N 

(2.3) TM{^J) = J2h^t=^J} 

t=i 

be the amount of time that job ij is processed and 1 denotes the indicator function. 
An optimal strategy would be one which maximizes 

N I Ji 

(2.4) WN{e) :=^^J2^o{Ee[g{Xt)l^^^=ijy\J^t-i]}. 

t=l 1=1 j=l 

In the case of independent rewards, that is, when pij (x, y,;9) = pij (y; 9) for all i,j,x,y 
and 9, Wn{9) = Ef=i E/ii fJ'ij{d)EeTN{ij)- We shall show in the Appendix that for 
Markovian rewards, under regularity conditions A3-A4 (see Section 2.3), there exists 
a constant Co < oo independent of ^ G 0, AT > and the strategy cf) such that 



/ Ji 



(2.5) WNi9) -EE f^ij{0)EeTNiij) 

i=i j=i 



<Co. 



When the parameter space Q and state space D are both finite, (2.5) also follows from 
Anantharam, Varaiya and Walrand (1987, Lemma 2.1). In light of (2.5), maximizing 
Wn{9) is asymptotically equivalent [up to a 0{1) term] to minimizing the regret 

Rn{9) := N^l*{9)-Y,J2^'^MEeTN{^j) 

i=i j=i 

(2.6) = E \M*i9)-f^iji9)]EeTNiij), 

ij:ixij{e)<ti*{e) 

where ^l*{9) := maxi<j<7 maxi<j< j. lJ-ij{9). 
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Because adaptive strategies cj) which are optimal for all ^ G and large N in 
general do not exist, we consider the class of all (asymptotically) uniformly good 
adaptive strategies under the partial order constraint ^, with regret satisfying 

(2.7) RN{e) = o(iV"), for all a > and 9 e 0. 

Such strategies have regret that does not increase too rapidly for any 9 e Q. We 
would like to find a strategy that minimizes the increasing rate of the regret within 
the class of uniformly good adaptive strategies under the partial order constraint ^. 

Due to the irreversibility constraint (2.1), a strategy satisfying (2.7) would in 
general be dependent on TV when there are more than one group of arms. Consider 
for example the case in which the optimal arm is unique and lies in the first group. 
Let ^? > be the probability that a strategy (f) bypasses the first group of arms before 
a fixed time Nq. If the strategy cf) is independent of N, then 

Rn{9) > p{N - No)[ii*{9) - max ma^ tji,,{9)] 

2<i<I 1<J< Jj 

and (2.7) does not hold. This is unlike the case of one group multi-armed bandit 
considered by Lai and Robbins (1985), Anantharam et al. (1987) and Agrawal et al. 
(1989a,b) whereby optimal strategies (f) satisfying (2.7) and not dependent on N have 
been constructed. 

2.3 The assumptions 

Denote the Kullback-Leibler information number by 

(2.8) Iij{9, 9')= f f log [M^lM) y. (a;; e)Q {dy)Q (dx) . 

Then, < Iij{9, 9') < oo. We shall assume that Iij{9, 9') < oo for all i,j and 9, 9' G 6. 
Let fJ-i{9) = maxi<j<j. HijiP) be the largest reward in the ith group of jobs, and 

(2.9) = G e : Hi{9) > fii'{9) for all i' < i and ^ii{9) > iXi>{9) for all i' > i} 
be the set of parameter values such that the first optimal job is in group i. Let 

(2.10) By = {0 G : iiij{9) = fii{9)} 
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be the parameter set such that job ij is one of the first optimal jobs. Each 9 E @ 
belongs to exactly one @i but may belong to more than one Oij. Let 

(2.11) &* = {9eQ: Hi{e) > Hi,{e) for ah i' / i} 

be the parameter set in which all the optimal arms lie in group i. Clearly, 0| C Gj 
but the reverse relation is not necessarily true. 

We now state a set of assumptions that will be used to prove the optimality results 
in Sections 3 and 4. Let 9 be a compact subset of R'' for some d > \ and let Xijt 
denotes the tth observation taken from arm ij. 

Al. /Ujj(-) are finite and continuous on G for all Moreover, no job group is 
redundant in the sense that G? 7^ for all i = 1, • • • , I. 

A2. E/ii hj{d, e') > for all 9' 9 and Me'eOij lij{0, 9') > for ah 1 < z < /, 1 < 
j < Ji and 9 G U£>iG^. 

A3. For each j = I, . . . , Ji,i = 1, . . . ,1 and 9 (E Q, {Xijt, t > 0} is a Markov chain 
on a state space D with cr-algcbra V, irreducible with respect to a maximal irre- 
ducible measure on (D,T>) and aperiodic. Furthermore, Xijt is Harris recurrent 
in the sense that there exists a set Gij € aij > and probability measure 
ifij such that Pij{Xijt € Gij i.o.|Xjjo = x} = 1 for all x ^ D and 

(2.12) Pij{Xiji e A\Xijo = x}> aijipij{A) for ah x G Gij and AeV. 

A4. There exist constants 0<5<1, 6>0 and drift functions Vij : D ^ [1, 00) such 
that for alH = 1, . . . , / and j = 1, . . . ,Ji, 

(2.13) sup\g{x)\/Vij{x) <oo, 

xeD 

and for all x E D and 9 E &, 

(2.14) P^jVijix) < (1 - b)Vijix) + 61{,eG,,}, 

where Gij satisfies (2.12) and PijVij{x) = Jj-)Vij{y)Pij{x,dy). Moreover, we 
require that 

(2.15) / Vij(x)i'ij(x; 9)Q(dx) < 00 and V^^ := sup Vij(x) < 00. 
Jd xeGii 
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Let iij{x,y;6,6') = \og\pij{x,y;9)/pij{x,y;9')] be the log likelihood ratio be- 
tween P^j and Pfj and Ns{e) = {9' : \\e - 9'\\ < 6} a ball of radius S around 9, 
where || • || denotes Euclidean norm. 

A5. There exists 6 > such that for all 6*, ^' € 6, 

Ee [supgjg^ (^,) 4' (^iio, ^iji ; 0, 9)\Xijo = x] 

(2.16) Kg 0, := sup —— < oo 

zeD Vij{x) 

for all j = 1, . . . , Jj, i = 1, . . . , /. Moreover, 

(2.17) sup \iij{x, y; 9', 9)\ ^ as 6' ^ 

for all x,y e D and 9' G 0. 

Assumption Al is for excluding some unrealistic models in which efficient but 
impractical strategies may exist. A2 is a positive information criterion: the first 
inequality makes sure that information is available in the first job group to estimate 
9; while the second inequality allows us to gather information in the ith job group for 
moving to the next group when 9 Qg for some i > i. Assumption A3 is a recurrence 
condition and A4 is a drift condition. These two conditions are used to guarantee 
the stability of the Markov chain so that the strong law of large numbers and Wald's 
equation hold. A5 is a finite second moment condition that allows us to bound the 
probability that the MLE of 9 lies outside a small neighborhood of 9. This bound is 
important for us to determine the level of unequal allocation of observations that can 
be permitted in the testing stage of our procedure. The proof of the asymptotic lower 
bound in Theorem 1 requires only A1-A3; while additional A4 and A5 are required 
for the construction of efficient strategies attaining the lower bound. 

We now demonstrate an immediate consequence of A3-A5 that for any 9 & Q and 
e > 0, there exists < 6' < 6 such that 

(2.18) E^^.^o)\ sup \iijiXijo,Xiji;9'M<e 

for all ij and 9' £ @. Note that the continuity of Iij{9, •) follows from (2.18). 

Since vTjj = C J2k^=oi-^i'j ~ '^ij^ij'^Gij)^^ij, where C is a normalizing constant, it 
follows from (2.14)-(2.15) that j^Vij{x)'Kij{x]9)Q{dx) < oo. Hence by (2.16) and the 
relation £ij{Xijo, Xiji; 9', 9) = f,ij{Xijo, Xiji;9, 9) - £ij{Xijo, Xiji;9, 9'), we have 



sup \£ij{Xijo,Xiji;9',9)\ 
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< ^TTij{e)\^iji^ijOiXijl'i&^0')\ + Et;-..(0) _ sup \iij{Xijo,Xiji;9,6)\ 

0eNs{e') 



< oo. 



As the convergence in (2.17) is monotone decreasing, it follows from the dominated 
convergence theorem that (2.18) holds. 

2.4 Bad sets 

Bad set is a useful concept for understanding the learning required within the group 
containing optimal jobs. It is associated with the asymptotic lower bound described 
in Section 3 and is used explicitly in Section 4 to construct the asymptotically efficient 
strategy. For 6 G B^, define J{9) = {j : IJ.*{9) = iJ.£j{9)} as the set of optimal jobs in 
group i. Hence 9 G @(j if and only if j G J {9). We also define the bad set, the set of 
'bad' parameter values associated with 9, as all 9' G &e which cannot be distinguished 
from 9 by processing any of the optimal jobs £j. More specifically, the bad set 

(2.19) Be{9) = G ( U • = « ^ •^(^)}- 

We note that if Iij{6,9') = 0, then the transition probabilities of Xijt are identical 
under both 9 and 9'. If 9' G 3^(9), then by definition, 9' Uj^j(^g-^Qij and hence 
J{9') n J{9) = 0. Let j G J{9) and / G J{9'). Then fief (9') > fiej{e') = Hij{9) > 
Hej'{9). Thus 

(2.20) Ief{9, e') > for all 9' G B^iO) and / G J{9'). 

The interpretation of (2.20) is as follows. Although we cannot distinguish 9 from 9' G 
Bi{9) when processing the optimal job for 9, we can distinguish them by processing 
the optimal job for 9' . This fact explains the necessity of processing non-optimal jobs 
to collect information. 

Assumption A2 says when sampling from the optimal arm one can distinguish 
any 9 value whose optimal arm is in a future group. But having a non-empty bad set 
says that when sampling from the optimal arm cannot distinguish some 6 value whose 
optimal arm is in the current group. These two statements are compatible. We now 
provide two examples from the celebrated multi-armed bandit problem to illustrate 
the idea of bad sets. 

Example 1: Independent armed-bandit problem. Let nii,...,nij denote 
J statistical populations specified, respectively, by density functions p{x;9j) with 
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respect to some measure Q. For simplicity, assume that x = 0, 1 and p(0; 9j) = 1 — 9j, 
p{l; 9j) = 9j, where 9j are unknown parameters taking values in [0, 1]. A multi-armed 
bandit problem searches for strategies to sample Xi,X2, . . . , sequentially from these 
J populations in order to maximize the expected value of the sum Sn = J2t^i as 
N ^ oo. 

Let 9 = {9i, . . .,9j). If 6* = (0.2,0.1), then the set of optimal arms J{9) = {1} 
and the bad set Bi{9) = {{0.2,92) : 0.2 < 92 < 1}. Even though arm 1 is optimal, 
experimentation from arm 2 is required to make sure that the true parameter value 
does not lie in Bi{9). 

The two-armed bandit problem studied by Feldman (1962) has @ = {(^1,^2), 
{92,9i)} with 6*1 ^ 6*2. It follows that Bi{9) = for all 9 e &. This leads to remark- 
ably low regret, Rn{9) = 0(1). 

Example 2: Correlated armed-bandit problem. Consider bivariate normal 
populations IIn, ni2, liis with respective mean vectors {fii, A), {^2, f^s) and {^3,^2 + 
A), where /xi, ^2, /^s, A are unknown parameters. The problem is to sample the random 
vectors sequentially to maximize the expected value of the first component of the 
observed sum, J2tLi ^t, as N ^ 00. Let 9 = (/Lti, /U2, /U3, A). If J{9) = {1}, then 

Bi{9) = {9' eQ:fii= m'i, A = A', max{fi'2, ^'3) > 

3 A lower bound for the regret 

The following theorem gives an asymptotic lower bound for the regret (2.6) of uni- 
formly good adaptive strategies under the partial order constraint :<. 

Theorem 1 Assume A1-A3 and let 9 € Q^. For any uniformly good adaptive strategy 
(p under the partial order constraint :<, 

(3.1) liminfii:iv(^)/logAr > z(0,^), 

AT— Kx) 

where z{9, £) is a solution of the following minimization problem. 

Ji 

(3.2) Minimize Y.J2i^'* (9)- fiij{9)]zij{9)+ il^* - f^tM^e) , 

i<ej=i jij{e) 
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bject to Zij{9) > 0, j = 1, . . . , Ji if i < £; j ^ J (9) if i = i; and 



^^ie'eeAJ:iUhj{0,0')zij{e) + j:p^,i2j{e,e')z2j{e)} > i, 



(3.3) 



mfe'Ge._i{E/ii hj{0, 0')zij{0) + ■■■ + E/=i he-i)ji0, 0')z^i_,y{0)} > 1, 
[ mfe'eB,w{E^<^ E/=i liji^, 0')zij{0) + lljim lejiO, O'>ej{0)} > 1- 

The first {£ — 1) inequalities in (3.3) axe due to the partial order constraints. 
When there is no partial order constraint and the jobs are independent, the solution 
of Problem A reduces to the lower bound given in Theorem 1 of Lai and Robbins 
(1985). 

Under the assumptions of Theorem 1, the strategies that satisfy, for 6 G &e, 
(3.4) lim i?iv(^)/logA^ = z{0,£), 

N—*oo 

are said to be asymptotically efficient. If B^^O) = 0, then the last inequality of (3.3) 
is removed. In particular, when 9 e &i, (3.4) implies that 



(3.5) RN{e) 



O(logiV) if Bi(e) ^0, 
o(logiV) if Bi{9) = 0. 

We shall assume that Bi{9) is non-empty for the underlying 9 € Qi, which is true for 
most applications. The case of B(^{9) = will be treated elsewhere. 

The following lemma will be used to prove Theorem 1. The proofs of both Lemma 
1 and Theorem 1 will be given in the Appendix. 

Lemma 1 Assume A2-A3. Let (j) he a uniformly good adaptive strategy under the 
partial order constraint -<. If 9 & 0^, then for every 9' G Q^., k < i, 

k Ji 

(3.6) lunM{Y,J2hj{0,0')EeTN{ij)}/logN > 1, 

1=1 j=l 

and for every 0' G Bi{9), 

■h 

(3.7) liminf | ^ ^ hj{0. 9')EeT^{ij) + ^ hj{9, 9')EQT^{lj)\l\og iV > 1. 
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4 Construction of asymptotically efficient strategies 



4.1 Outline of the construction 

The goal of any reasonable strategy is to determine whether the job currently under 
processing is optimal or not based on sequential observations. The job under process- 
ing, say job ij, is optimal if G Qij. Thus, the problem of constructing an efficient 
adaptive strategy reduces to that of finding a procedure to determine whether 6 G @ij 
is true or not based on a sequential sample. The asymptotic lower bound discussed 
in Section 3 gives us valuable information about the size of the sequential sample. 
In particular, it suggests that for 6 G 0^, the amount of processing time for job ij, 
j = 1, . . . , Ji, i < £, and j ^ J{6) Mi = I should be [zij{0) + o(l)] log N, where Zij{6) 
solves the minimization problem (3.2). 

In view of Theorem 1, the sample size [zij{9) + o{l)] log N represents the minimum 
amount of learning about job ij in order for the strategy to be uniformly good. 
Because of the partial order constraint ^ , we also need a sequential test to ensure that 
the optimal job is passed over with probability not exceeding A^~^. These two facts 
are important guidelines for the construction of asymptotically efficient strategies so 
that the two crucial issues mentioned in the abstract and Section 1 can be addressed. 

Let no, ni be positive integers that increase to infinity with respect to N such that 
no = o(logA^) and ni = o(no). We shall now describe the asymptotically efficient 
strategy (/>* by dividing it into three distinct stages; estimation, experimentation and 
testing. 

In the estimation stage, no = o(logA^) observations arc taken from each job in 
group 1 for estimating the parameter 9 G 0£. If ^ > 1 or £ = 1 and Bi{9) ^ 0, then 
an order of log N observations arc taken in the experimental stage which contribute 
[z{9,i) +o(l)]logA^ to the regret; see (3.1). Finally, in the testing phase, o(logA'^) 
observations are taken from each of the suboptimal jobs. We first consider the optimal 
strategy for the case of finite Q, which captures the essential ingredients without too 
much technical details. We then extend the strategy to infinite followed by a formal 
statement of optimality in Theorem 2. 
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4.2 Optimal strategy for finite 9 

1. Estimation. Take an initial sample of no observations from each job in group 1. 
Let 6 be the maximum likelihood estimate (MLE) of 9 defined by 

Ji no 

(4.1) L{e) = J^J^logpij{X^j^t_,),Xijt;e), e = argm^L(^). 

j=it=i 

Let k = l. 

2. Experimentation. Let [-J denote the greatest integer function. 

(a) If ^ G Uj>feGi: Take [zkj{0) log N\ observations from job kj for j = 1, . . . , Jj.. 

(b) li 9 e @k' Take lzkj{9) logiVj observations from job kj for j J {9). 

(c) If 6* E Uj<fe0i: Skip experimentation phase. 

3. Testing. Start with a full set {kl, . . . , kJ^} of unrejected jobs. Let n = (nn, . . . , n^j^.), 
where Hij denotes the number of observations taken from arm ij so far. The rejection 
of a job is based on the following test statistic. Let F^, 1 < /c < /, be a probability 
distribution with positive probability on all open subsets of U^^^Bj. Define 

(4.2) C/fc(n;A) = j jj- 

nj=i Y\j=i ^iji^ijo', A) nt=i Pij{Xij{t-l) ) ^ijt', A) 

for all A G @k- 

(a) If ^ G Ui>fe9i: Add one observation from each unrejected job. Reject parame- 
ter A if Uk{n; A) > N. Reject a job kj if all A G @kj have been rejected at some point 
in the testing stage. If there is a job in group k left unrejected and the total number 
of observations is less than N, repeat 3(a). Otherwise go to step 4. 

(b) If ^ G Ok- Add ni observations from each unrejected job kj, j G J{9) and 
one observation from each unrejected job kj, j ^ J{9). Reject a job kj if all A G 6^^ 
have been rejected at some point in the testing phase. If there is a job in group 
k left unrejected and the total number of observations is less than N, repeat 3(b). 
Otherwise, go to step 4. 

(c) If ^ G Ui<jfc9i: Adopt the procedure of 3(a). 

4. Moving to the next group and termination. The strategy terminates once N 
observations have been collected. Otherwise, if A; < /, increment A; by 1 and go to 
step 2; if A; = /, select all remaining observations from a job Ij satisfying lJ-ij{9) = 
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We shall now describe how each feature of the proposed strategy leads to asymp- 
totic optimality in Theorem 2. The positive information assumption in the first half 
of A2 allows us to estimate 9 consistently and hence enables us to determine the 
optimal sample size zij{0) in the experimental stage of group 1. The assumption 
is important because once we move to the next group of jobs, irreversibility would 
prevent us from making up any shortfall in the optimal sample size required from 
group 1. By selecting no — oo, we ensure the consistency of while by choosing 
no = o(log A/"), the estimation of incurs negligible contribution to the regret. 

Let k be the current group of jobs under sampling. Consider first ^ G 0^ for 
some £> k. We are instructed to select [zkj{6)logN\ observations from each job in 
the experimental stage. By Theorem 1 and the consistency of 6, this is optimal for 
learning. If 9 E for some £ < k, then the estimate 9 says that we have overshot the 
optimal group, the estimate 9 cannot be trusted. In both eases, our strategy then is 
to rely on the testing stage to decide if we should stay within the current job group. 

The testing stage is important in stopping us from moving beyond the first group of 
optimal jobs. The rationale is that by irreversibility, the penalty for moving beyond 
the first group of optimal jobs can be of order N, which is large compared to the 
desired regret of O(logA^). The usefulness of the testing stage in this aspect can 
be seen from (4.6) below, which guarantees that the regret due to overshooting the 
optimal job group is 0(1). The positive information assumption in the second half of 
A2 is necessary for the testing stage to be successful. 

Let us now consider the strategy in 3(b). If 9 = 9, then lzkj{9) logiVj observations 
from arm kj is taken in the experimental stage and hence by the last inequality of 
(3.3), o(\ogN) observations from jobs with positive information are needed to reject 
A G Bi{9) in the testing stage but we may still need an order of log A observations 
to reject X £ Qi\ Bi(9). Since we would like o{logN) observations from subopti- 
mal jobs in the testing phase, sampling equally from all jobs would be undesirable 
here. We consider instead the selection of ni observations from job ij, j € J{9) 
for each observation from the other jobs, where rii goes to infinity with N, so that 
0{n^^ log N) = o(logA^) observations are taken from suboptimal jobs when 9 = 9. 
When 9 ^ 9, it might be possible that each job kj, j G J (9) would provide no in- 
formation to reject some A G Gfe \ Ujgj(g)0fcj. Our procedure would then allocate 
0{nilogN) observations from suboptimal jobs in the testing phase conditional on 



13 



this happening. By A5 and Chebyshev's inequahty, the probabihty of providing an 
incorrect estimate of is 0{nQ^) and hence by specifying ni = o(no), we ensure that 
the average contribution from suboptimal jobs is 0(nQ ^ni log TV) = o(log7V). 

The final case G Ui<fe0i occurs with o(l) probability, which together with the 
0{\ogN) observations taken in the non-optimal jobs in the testing stage when this 
happens, results in an overall o(logA?^) contribution to the regret. 

The last step is to proceed to the next group of jobs when all parameters in Bfe 
have been rejected. The exception is when k = I. To be at stage 4 when k = I, all 
9 e & have been rejected at some point in time. Clearly, the true parameter has been 
rejected as well but this occurs with very small probability and the contribution to 
the regret in this case is asymptotically negligible. 

4.3 Extension to infinite 6 

Let 9 e @e he the true underlying parameter. When @ is finite, consistency of 9 
would imply that 9 = 9 with probability close to 1 when N is large. Hence B£{6) 
and J{9) would be good substitutes for the unknown Bi{6) and J{9) respectively. 
Complications arise when 6 is infinite. Firstly, it is possible that B£{9) is non-empty 
while B({9') is empty for all 9' arbitrarily close to 9. Secondly, by continuity of 
it follows that there exists S > such that 

J {9') C J (9) for all 9' G Nsi9) n G^, 

but the preceding statement with C replaced by = is not necessarily true. Hence 
Bi{0) and J{9) are in general poor substitutes of Bi{9) and J{6) when Q is infinite. 
Moreover if 9 lies on the boundary of O^, then (Uj>^Oj) H Ns{9) can be nonempty for 
all small 6 > 0. This implies that Zkj{9) may be inconsistent for Zkj{9). This would 
not happen when Q is finite. 

Our strategy in extending the optimal procedure from finite Q to infinite Q is not 
to select 9 during the estimation phase but rather to select some appropriate adjusted 
estimate 9a G -^5/2 (^) where (5^0asA'^— ^ooata rate that is specified in Theorem 
2 below. We require firstly that 

(4.3) 9a e Ns/2{9) n where t = min{z : 9^ n Ns/^iO) + 0}. 

This condition ensures that if 9 lies in the boundary of 9^, then the probability that 
9a G 9^ tends to 1 as AT ^ 00. Our next condition would ensure that the probability 
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that J(9a) = J{0) tends to 1 as iV — > oo. Let | • | denote the number of elements in a 
finite set and 

J = max{\J{e')\ : 9' G Ns/2{e) D 
We require in addition to (4.3), that 

(4.4) OaeH ■.= {ee Ns/2{e) n : \ J{e)\ = J}, where £ is defined in (4.3) . 

If @ is finite, then for (5 > small enough, Ng/2{^) = {^} ^^id hence by (4.3) and (4.4), 
Oa = 0. Therefore the selection of 9a for infinite @ is consistent with the selection 
procedure for finite when N is large. The final thing left to do is the estimation 
of B({6). This can be done by taking a union of B({9') over 9' G H. We thus have 
the following modification of the optimal strategy for infinite G, which reduces to the 
optimal strategy for finite for (5 > small enough. 

Optimal strategy for infinite 0. 

1. ' Estimation. Let k = 1 and 9a be an adjusted MLE satisfying (4.3) and (4.4). 

2. ' Experimentation. Let z^j be the solution to Problem A with parameter 9a and 
with the bad set Bii{9) replaced by \Jqi^}jB^{9'). 

(a) ' If 9a G Ui>fe0i: Take \ zkj log A''] observations from job kj, j = 1, . . . , Jk- 

(b) ' If 9a & Ok- Take [zkj log A^J observations from job kj for j J{9a)- 

(c) ' If 9a G Ui<fe0i: Skip experimentation phase. 

3'. and 4'. Identical to the strategy for finite 0, with 9a replacing 9. 

In view of (4.3) and (4.4), the modified strategy (j)* described above will lead to 
asymptotic efficiency for infinite as stated in Theorem 2 below. It is also convenient, 
when is infinite, to decide on the rejection of a job in step 3 based on the current 
sample rather than to keep track of which A have been rejected previously. Hence for 
practical use, we can also make the following modification to the rejection of jobs in 
step 3': 

Let Ukj{n) = infAeefcj Uk{n; A). Reject job kj if Ukjin) > N. 

Theorem 2 Assum,e A1-A5. The strategy 4>* has error probabilities from the esti- 
mation stage satisfy the following properties. Let no — > oo with no = o(log N) and 
ni ^ oo such that ui = o(no). Then there exists S{= 6n) j such that 

(4.5) P0{9a G \ Ns{9)} = o(nr^) as AT ^ oo. 
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Let 6 G B^. Then the regret ofcf)* due to overshoot in the testing stage is 0{1) because 

Ji 

(4.6) Y.12EeTN{ij)<l. 

i>e i=i 

Therefore, the total regret 

(4.7) lim RN{e)/logN = z{ej). 

AT— >oo 

Remark 1. Theorem 2 extends Puh and Hu (2000) to situations where more than 
one job in each group are available for processing. Theorem 2 generalizes the results 
of Lai (1987) and Agrawal et al. (1989a,b) to the case of infinite state and parameter 
spaces and more than one job group. 

Remark 2. If there is a constant switching cost each time we switch from one job 
to another, it can be shown that the strategy cf)* has switching cost of order o(logiV). 
Hence ^* is still efficient considering switching cost. The details will be given in 
another paper. 

Remark 3. We consider non-empty bad set in this paper. It can be shown that 
the proposed strategy (j)* can achieve o{logN) regret, when the bad set is empty and 
1=1. In general, within the optimal group, the contribution to the regret from jobs 
optimal for parameter values outside the bad set is o(log N) . The essence of the proof 
for this fact is contained in Section 6. We will provide detailed justification in another 
paper. The upshot is that the strategy ^* can achieve super efficient results outside 
of bad sets. 

5 Examples 

Example 3: Multi-phase project management. To illustrate how our method 
can be applied, wc discuss a few examples. Our purpose here is not to provide 
an accurate statistical model for a particular situation, but rather to supply concrete 
examples of parameter spaces and probability distributions such that the assumptions 
in Section 2.3 are satisfied. 

Consider the management of N research and development (R&D) projects. When 
a project is pursued with one unit of resource, the reward is a normal random variable 
X with mean /nt(^) and variance erf {6). Given the parameter value 0, the mean iJ.t{6) 
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reflects, at time t, the level of existing technology and knowledge relevant to the 
concerned projects as well as the competition in the market. Let 9 = (a,/3) and 

(5.1) M*W = rM> <^t{0)- ^ 



where both f{t, a) (reflecting technology and knowledge) and h{t, (3) (reflecting com- 
petition) are increasing functions of time t. Observe that under (5.1) the coefficient 
of variation, at/nt = ^/ fifiOi) is a decreasing function of t, which can be interpreted 
as follows. Because the products from the project will be gradually superseded by 
more advanced ones through competition in the market, therefore not only the mean 
reward becomes smaller but we are also more certain of it due to as time moves on. 
If we take / and h to be 

(5.2) f(t,a) = at^, and h{t, P) = e^^ - 1, 

then the maximal value of Utid), for a fixed value of 0, is attained uniquely at t such 
that t(3 = constant 1.5936. 

Designate / phases indexed by time points < ti < • • • < tj during which 
pursuing a project can take place. And there are J different types of projects that 
can be pursued at any phase i = 1, . . . ,1. To accommodate / phases and J types 
of projects, we expand the parameter vector to ^ = (ai, . . . ,aj,(5). Given (5.1) and 
(5.2), let the reward Xijk from the pursue with /c-th unit of resource of the type j 
project in phase i be i.i.d. normal with means and standard deviations 

respectively. 

By selecting 6 = [aja]"^ x /3] where < a < a < oo and < /3 < /3 < oo, 
condition Al is easily seen to hold. Let 0' = (a^ , • • • , a'j, /3'), then 



a 



'"^ af(g)-c7f(gO + k,(g)-/x,,(gO]^ 



+ 



equals zero if and only if fiij{6) = fJ-ij{0') and (Tfj{9) = afj{9'), or equivalently, aj = a'j 
and /? = /?', the information assumption A2 is also satisfied. It can be shown that 
there exist (^ =)Pi < < • • • < /?i < /?o(= ^) such that 

{9ee:Pe [Pi,Po]} for i = 1 
{9ee:pe[pi,Pi-i)}ioi2<i<I, 



(5.4) @i = 
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and 



(5.5) 



max ah 
l<k<J 



Since the observations Xij^ are independent, the assumptions A3-A5 are satisfied 
by selecting Gij = R, Vij(x) = + 1 and b > snp^ j^Q Eg\Xiji \ + 1. Consequently, 
the strategies described in Section 4 are efficient in the sense of attaining the regret 
lower bound given by Theorem 1. 

Example 4: Multi-phase project management with Markovian reward. 

Continuing from Example 3, instead of i.i.d. reward, we assume that k-th pursue 
of a project of type j at time U follows an AR(1) process 



where |oi| < 1 and Cijk ^ N (/lij (6) , af (9)) with /lij and (jj given by (5.3). Let 
Gij = [— c, c] for some c > 0. Since eiji has a positive density on the real line, A3 is 
satisfied. Let Vij{x) = \x\ + L Prom Meyn and Tweedie (1993) page 380, {Xijk}k>o 
is geometric ergodic and A4 holds with < 6 < 1 — maxj \ai\ and b, c large enough. 

The stationary distribution is normally distributed with mean and variance given 
by (1 — tti)"^ iiij{9) and (1 — af)~^af{9). It can be checked that (5.4) and (5.5), 
which reveal the structure of the parameter space, still holds for AR(1) reward. 
Consequently, Al is true for AR(1) rewards. To simplify the presentation of the 
Kullback-Leibler information number, we drop the indices i, j and use /x', a' to denote 
fi{9'),a{9'), respectively. 



It is clear that the Kullback-Leibler number is greater than zero if 9 ^ 9'. Prom the 
preceding equation, we can verify that A2 and A5 hold. 

6 Proof of asymptotic efficiency 

We shall demonstrate the asymptotic efficiency of (j)* by proving (4.5)-(4.7). A change- 
of- measure argument is first used to prove (4.6). As the proofs of (4.5) and (4.7) are 




+ 



(g - a'f{^l'^{l - g)-^ + (t^(1 - a^)'^} + 2(a - a'){n - /xO/i(l - a) 



-1 
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too involved for one reading, we prove them in Section 6.1 for the restricted case of 
finite and extend the proofs to infinite @ in Section 6.2. 

Proof of (4.6). Let P be the measure which generates Xn := {Xijt} for j = 1, . . . , Jj 
and i = !,...,£, t = 1, . . . ,nij in the following manner. First generate 9' randomly 
from F£. Using the strategy <p* to select the jobs to be processed, generate Xijo from 
and Xijt, t > 1, according to the transition density Pij{Xij(^f._ij, ■■,6') when 
at job ij. Let 9 G Qej- Then 



Let T = (Tjv(ll), . . .,TN{eJe)) and A = {C/^(T;^) > N}. Then P4Ei>^rivW > 0} 
is bounded by 

(6.1) PeiA)=Ep\^iXr)lA]<N-\ 
Hence (4.6) follows from (6.1) and the bound Ei>£Tiv(«) < iV. □ 

6.1 Finite pEirameter space 

Let = {9o, . . . ,9h}. Let 9o G Q^Jq be the true parameter value. For 1 < q < h, 
define 

(6.2) ^,ijt{q) = log\pij{Xij(^t_i^,Xijt;9o)/pij{Xij(^t_i),Xijt;9q)]. 

Then E^(0Q)(,ijtiQ) = -^ij(^05^q)- To get the essence of the strategy without being 
overly involved in cumbersome notation, let us consider a specific case £ = 2, Ji = 
J2 = 2, 00 e 021 and J{9o) = {1}. 

We first prove (4.5). Let us consider the inequality 

h h no 

PeoiO / ^o} = E Poo{0 = < E ^^o{ Eeiu(^) + ^m{q) < o}. 

q=l q=l t=l 

By A5 and Chebyshev's inequality, 

r.(^. . -1 Var,„(Erii6it(9)+a2*(g) 
Peo{ T.^ntiq)+^m{q) < o} < 

E.je,) [^!it(g) + ihM)] + 2Jii(go, g,)/i2(go, 9,) 

-^'^""^'^^ no[/n(^o,^.)+/i2(^o,M = ' 

This completes the proof of (4.5) for finite parameter case. 
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We now undertake the proof of (4.7). For q > 1, let 9q E Gkj' where either (i) 
k < £ or (ii) k = £ and j' ^ J{0o). Let Tkj{q) be the number of observations selected 
from job kj in the testing phase of group k before parameter 9q is rejected. To show 
(4.7), it suffices to prove that 

(6.3) Ee,[Y^Tkj{q)]=o{logN)ifk<£ e^nd Ee,[ J2 njiq)] = o{logN) 

because (6.3) impHes that the regret in the testing phase before leaving the optimal 
group is o(log A^) and the regret due to overshooting the optimal group, which is also 
o(log A^) by the established (4.6), complete the justification. 

Select C > large enough such that ^[ji{q) = ^ijtil) A C has positive expectation 
under t^{9q) for all i,j,q satisfying Iij{0Q,9q) > 0. Let n = (nii,ni2). We will first 
show that the first half of (6.3) is satisfied when Og € @i. By (4.2), 

nil ni2 

(6.4) logC/i(n;^g) > + 5Z^i2t(9) + log^'n + log^i2 + logFi(^o), 

t=i t=i 

where vij = 'mfxfi^x[iyij{x;d)/uij{x; X)] > as assumed in (2.2). Hence by (4.2), 
rejection of 6q has occurred when 

(6.5) > c := log A/" - logvii - log'Ui2 - logFi(6'o), 

where m = {mn, 11112) = (^0 + L^ii (^) log iVj , ^0 + [-Zi2(^) log A^J) is the sample size 
at the beginning of the testing phase. Since djti^) is bounded above by C, it follows 
that at n = {n'-^i,n'^2) which the boundary is first crossed by ^^^^'s 

mil mi2 "-'ii "'12 

t=l t=l t=mii+l t=mi2+l 

(6.6) 

By (4.5), the condition no = o(log A^), (6.5), and the constraint /ii(^O) ^9)2^11(^0) + 
Ii2{0o,0q) zi2{9o) > 1 from (3.3), it follows that 

mil mi2 

(6.7) Ee, [ J2 6ii(9) + E ^Mq)] > (1 + o(l))c. 

t=i t=i 

Subtracting (6.7) from (6.6), we have 

nil "12 
t=mii+l t=mi2+l 
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By Wald's equation for Markov processes, the left hand side of (6.8) equals 



(6.9) (1 + o{l))[E^^0^)^[u{q)E0^{nii - mu) + E^^g^)^[2ti'l)^eoi^i2 - ^12)]- 

The proof of Wald's equation for Markov process is given in the Appendix. By A2 
and the choice of C for ^ijt{q), E^{9o)^ijt{(l) > 0. In view of the sample size in testing 
stage Tij{q) < nij — mij, it follows from (6.8)-(6.9) that E0g[Tii{q) + Ti2{q)] = o(c) 
for all 6q G 61. Hence the rejection of both 0ii and 812 involves only o(logiV) 
observations and the first half of (6.3) holds. 

Next we show that the second half of (6.3) holds when 9q G 022- We divide into two 
cases, eg e -B2(6'o) and Og ^ B2(6'o). Consider the first case. By (2.20), /22(6'o, dq) > 0. 
We then follow the arguments above using (4.5) and the last inequality of (3.3) to 
show that Ee^T22{q) = o{c). 

The second scenario involves Og B2{9q). The key observation is l2i{Go,dq) > 
by (2.19). In other words, information is always collected and no additional regret is 
incurred when we sample from job 21. Under unequal sampling, 

EeoT22{q) = ^9oh2(g)l|j(0)=|i}|] +^eoh2(g)l{j(e)=|1^2}}] +-^eo[T-22(g)ly(e)={2}}] 

= "r^^eohi(g)l|j(e)=|i|}] +-Eeo[r2i(Q)l{j(g-)^{i^2}}] 

(6. 10) +niEg^ [t21 (g)l{ j(0)={2}}] • 

Since Ee,[T2i{q)l^j^i^^^^y] < (1 + o(l))cPeJ J(^) = ^}//2i(^o, ^g) for A = {1},{2} 
and {1, 2} and as ni — > 00, the first term on the right hand side of (6.10) is o(c) while 
(4.5) ensures the second term is o(c). By (4.5), niPog{J{0) {1}} < niPog{0 
Oq} = 0(1) and thus the third term on the right hand side of (6.10) is o(c). We can 
conclude that EggT22{q) = o(c) or the second half of (6.3) holds. 

6.2 Extension to infinite pcirameter space 

We preface the extension with the following lemma. The proof of this lemma is given 
in the Appendix in Section 7.3. We shall let A denote the closure of a set A. 

Lemma 2 Let Oq G Qj^. Assume A1-A5 and let no 00, ni = o(no). 

(a) Let 9' ^ 6q and let 9 he the MLE estimate (4.1). Then there exists 5' > Q small 
enough such that 

(6.11) Pea{9 eNs,{9')} ^0 as N ^00. 
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(b) Let 9' G 0fe for some k < i or 9' E Uj0j(6»o)®^j- -^^^ 6' > and let t^j (t^j) be 
the number of observations selected from job kj (ij) in the testing phase of group k 
(I) before all parameters in the setNgi{9') are rejected. Then for S' > small enough, 

(6.12) Ee, ( ^ Tkj) = o(log N)ifk<i and Eg, ( ^ nj) = o(log TV). 

We now apply Lemma 2 to extend the proof of Theorem 2. By the compactness of 
Q\Ng/2{(^o), S > 0, there exists a finite set {^i, . . . , 9h} and constants Sq > such that 
(6.11) holds for 9' = 9q and S' = Sq for all 1 < g < ^ and \ {^o} ^ Ug=i Ns^{dq) ^ 
\ Ns/2{9o)- Then by (6.11), P^J^ G 6 \ Ns/2{9o)} ^ as ^ oo and the result 
(4.5) follows from (4.3) because \\9a - 9\\ < S/2. 

It remains to show that the number of observations taken from each non-optimal 
job in the testing phase is o(logA'^). Consider k < i,j = 1,..., or /c = i with 
j ^ J{9q). Since Qkj is compact, there exists a finite set {^i, . . . ,6h} and constants 
5{q) > such that (6.12) is satisfied for 9' = 9q, 6' = 6q for all 1 < g < ft. and 
ljq=i Nsg{9q) D Qkj, and hence by (6.12), the number of times job kj is processed in 
the testing phase is o(log N) as required. 



7 Appendix 
7.1 Proof of (2.5) 

Let Xijt denotes the tth observation taken from arm ij. Then 

I Ji I Ji oo 

(7.1) \Wn{9) f^dO)E0TN{ij)\ < E E E \Ee9{Xijt) - fiij{9)\. 

i=ij=i 



For any signed measure A on (D,P), let 



(7.2) 



lAI 



sup 

h:\h\<Vij 



i=l j=l t=l 



h{x)X{dx) 



It follows from Meyn and Tweedie (1993, p.367 and Theorem 16.0.1) that under A3 
and the geometric drift condition (2.14), 



(7.3) 



Uij := sup ■) - ^iji-' 0)\\Vij/Vij{x) < oo. 
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where Pfjtix, •) denotes the distribution of Xijt conditioned on Xijo = x and Tr^ (•; 6) 
denotes the stationary distribution of X^jt under parameter 9. By (2.13), there exists 
K > such that /«|5(a;)| < Vij{x) for all x G -D and hence it follows from (7.2) and 

(7.3) that 

oo 

(7.4) Kj2\Ee,xg{Xijt) - fiij{0)\ < ujijVijix), 



t=l 

where Eg^^ denotes expectation with respect to Pg and intial distribution XijQ = x. 
In general, for any initial distribution 6), it follows from (2.15) and (7.4) that 

oo . oo 

J2 \E09{Xijt) - Hij{e)\ < / E \Eo,x9{Xijt) - Hij{9)\uij{x;e)Q{dx) < oo 

t=i t=i 

uniformly over ^ G 6 and hence (2.5) follows from (7.1). 



7.2 Proof of Lemma 1 

To prove (3.6), it suffices to show that for every 6' E Q^, k < £ and for 5 > a > 0, 

k Ji 

(7.5) lim Pg{Y,J2^iji^^^')TN{ij) < (l-5)logiv} =0. 

1=1 j=l 

Because is uniformly good and 6' G 0^, it follows from (2.7) that Eg' [N—J2jeJ{e') ^Jv(^j)] = 
o(iV") for a > 0. By A2, 4^(6', 6') > for all j G J{9') and hence Iq := mmj^^j^g,) IkjiO, 6') > 
0. It then follows from Chebyshev's inequality that 

k J 

(7.6) Pg\Y.T.^i3i.^^()')TN{ij) < (l-<5)logiv} 

1=1 j=l 

< Pe\lo E Tiv(fcj) < (l-(^)logA^} 

iGJ(9') 

= Pe\[N- E TM{m>N-{l-5){\ogN)/Io} 

jeJ(9') 

= 0{N-')E,>[n- E T^^(A:j)] =o(iV"-i). 

ieJ(e') 

Let n = (mi, . . .,nkj^) and Tjv = (riv(ll), . . .,TN{kJk)). Let 

k Ji '^ij 

= E E { ^ogli^ijix^jo; e)/u^Jix,,o■, 0')] + E ^ij{Xij(^t-i),x,,t; 0, e')} 
i=i j=i t=i 

be the log likelihood ratio of 6 with respect to 9', and denote 

k ,h ^ 

GiV = { E E ^')TN{ij) < (1 - 5) log and Lt < (1 - a) log iV L 
1=1 j=i J 
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Then by (7.6), Pq'{Gn) = o(iV"-^). By Wald's likelihood ratio identity for Markov 
chains, 



Pq' {Tat = n,Ln < (1 - a)logiV} = -Ee[exp(-L„)l{Tjv=n,L„<(i-a)iogAr} 
> iV"-ip4Tjv = n, L„ < (1 - a) logiV}. 

By summing the preceding inequality over all n, we have 

(7.7) Pg{GN) < N^-"Pq' {Gn) = A^^-"o(Ar"-i) = o(l). 

By A3 and the strong law of large numbers for Markov chains (cf. Theorem 17.0.1 
of Meyn and Tweedie, 1993), 



i=\j=l 



01 ^ ^ riij] Pg a.s. as ^ ^ riij oo. 



Thus, 
lim \ 



i=l j=l 



k Ji 



--lj=l 



max 



Ln — E E ^iji^^ G')'nij I m| a.s. under Pg. 



Because 1 — a > 1 — (5, it then follows that as AT ^ oo, 

k .h 

Pe{Ln > (1 -a) log AT, for some n such that ^ J2 ^vi^' < (1 -<^) log ^} ^ 0- 

i=i j=i 

Therefore, as A ^ oo, 

k -A 

E E ^')TN{ij) <{1-S) log A and Lt > (1 - a) log a} ^ 0. 

i=i j=i 

This combined with (7.7) gives (7.5), from which (3.6) follows by letting 5 | 0. 

We now consider the case 9' G Be{e). By (2.20), mmj^j^^Q,) I(,j{e,e') > 0. The 
proof proceeds as before with k = i, which leads us to (3.6) with k = i. Since 
Iij{0,9') = for ah j € J(0) by (2.19), (3.7) follows. 



7.3 Proof of Theorem 1 

As we mentioned after (3.5) that Be{e) / 0, by Al, A^? = GJ x • • • x e|_i x B(,{e) is 
non-empty. For each A = (Ai, • • • , A^) G A^ and 9 G 0^, we define z{0, £, A) to be the 
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minimal value of (3.2) with (3.3) replaced by 



(7.8) 



E/ii/i,(^,Ai)zy(e)>l, 



{ Ei<^E/=i/ii(^,A,)z,,(^) + E,-^j(e)^^i(^,A,)z,,(e) > 1. 

By Lemma 1, (7.8) is true for all A G A^. Therfore, liminfjv-^oo-RAr(^)/logiV > 
supAeA^ z(0,^, A), for all 9 G 6^. The proof is completed, if we can show that 



(7.9) 



z{e,C) = sup z{e,i,x). 



If Z = {zij{d) : j = 1, - ■■ ,Ji foT i < £, and j ^ J{0), i = £} satisfy (3.3), then Z also 
satisfy (7.8). Thus 



(7.10) 



z{e,i) > sup z{e,i,x). 



Because Iij{9,0') arc continuous with respect to 9', the infimums in (3.3) are 
attained for some A € Ai, the closure of A^. Choose a sequence of A(n) = (Ai(n), • • ■ , 
Xi{n)) € A£ such that it converges to the A = (Ai, • • • , A^). Note that A depends on 
some feasible z satisfying (3.3). 

Let Zn = {zii{n) , ■ ■ ■ , Z£j^{n)) be the solution of (3.2) satisfying (7.8) with A = 
A(n). Set 

Cijin) = max{Iij{9,Xi{n))/Iij{9,Xi), . . . ,Iij{9, Xi(n))/Iij{9,Xi)}. 
By the continuity of /y, we have 

(7.11) lim Ci, (n) = 1, for 1 < i < £. 

n— >oo 

In view of Cij{n)zij{n)Iij{9,Xi) = Zijin)Iij{9 , Xi{n)) for i,j in an appro- 
priate index set, we see that {cij{n)zij{n)} satisfy (3.3). Hence, 



^^^ jiiax.^^Cij{n) z{9,i,Xn) 



i<i<e, i<j<Ji 

Ji 

^ E E[^*(^) - Mii(^)]cij(n)z,j(n) + [/^*(^) - /x,,(e)]Q,(n)z,,(n) > z{94). 

By (7.11), we have sup;^gyY(> ^) A) > z{9,£), which combined with (7.10) implies 
(7.9). 
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7.4 Proof of Lemma 2 

By (2.18), there exists 6' > such that 

(7.12) E^ij{eo)\ sup \i,j{Xijo,Xijr,e'M<£ 
for all i,j and ^' G 0, £ > to be specified later. Let 

(7.13) = £ij(Xij(j_i),Xyf;^o,^') - sup £ij(Xij(j_i),Xijt;^', A). 

XeN,,{e') 

Since r/ := Y.jLihjiOo,0') > 0, we can select S' > to satisfy (7.12) with e < r]/Ji. 
Then by (7.12)-(7.13), it follows that 

Ji no ^ Ji 

^0 '^^(eo) [ E E ^y*] ^ E ^1^- (^0, 0') -Jie>r,- Jis > 0. 

j=i t=i j=i 

By the Harris recurrence condition A3 and the law of large numbers, it follows that 

Ji «() ^ 

P{A) ^1 as AT ^ oo, where -4 = { ^ ^ > o}. 

i=i t=i 

In the event A, the likelihood at is larger than all A G Nsi{9') and hence (6.11) 
holds. 

To prove (6.12), we extend (6.2) and define 

-Pij{Xij(t-l)i^ijt',())- 



Ciit = inf log 



'-Pij{^ij{t-l) ) ^ijU A) - 

> iij{XijUi\, Xijt;6o,9') — sup \£ij{XijUi\,Xijt;9o,6)\ 
(7.14) - sup \eij{X,j^t_^^,Xijt;e',X)\. 

Let 9' € 0fej„ for some k < L By A2, wc can select < e < IkjoiOo, 9')/2Jk and hence 
by (7.12) and (7.14), wc have ^.(9,,) ( E/^i 4,t) > hj,{9o,9') - 2Jke > 0. 
By (4.2) and (7.14), it follows that 

k Ji k Ji nij 

('^•1^) . inf logC/fc(n;A)>logFfc(iV5,(eo)) + EElog% + EEE4t> 

^^^^S'iS) i=lj=l i=lj=lt=l 



where = mfx^0^x[i'ij{x; 9) /vij^x; X)]. By (7.15), Tkj < rikj - rukj, where n = (riij) 



is the sample size needed for Y,i=i E/=i E"=i ^ijt to cross the threshold c := log A/" ■ 
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Y,i=i Y^jLi ^ogVij — log Fk^Ngi (Oq)) and m = {rriij) is the sample size at the start of 
the testing phase. Now follow arguments analogous to (6.4) - (6.9), we can prove the 
first half of (6.12). 

Next, let us consider k = £. Let f{6) = ntjo{0) — supjgj(gg) /x^j(^) for some 
jo ^ J{eo). Then /(^o) < 0. Conversely, f{e') > for any 9' G Qejo- By Al, / 
is continuous with respect to 9 and hence inf^/g©^^^ \\9o — 9'\\ > 0. The proof for 
second half of (6.12) then follows from the arguments similar to those in the last two 
paragraphs of Section 6.1. 

7.5 Extension of Wald's equation to Mcirkovian rewEirds 

As we will be focusing on a single job ij and fixed parameters 9o, 9q such that /x := 
lij (^0) (^q) > we will drop some of the references to i, j, 9o, 9q and q in this subsection. 
This applies also to the notations in assumptions A3-A5. Moreover, we shall use the 
notation E{-) as a short form of Eog{-) and -E'^(-) as a short form of E0q{-\Xq = x). Let 

Sn = ^i-\ \-^n, where = ^og\pij{Xk-i,Xk;9o)/pij{Xk-i, Xk;9q)] has stationary 

mean under Pg^ and let r be a stopping-time. We shall establish Wald's equation 

(7.16) ESr = [n + o{1)]Et 

for Markovian rewards. 

By (2.12), we can augment the Markov additive process and create a split chain 
containing an atom, so that increments in Sn between visits to the atom are indepen- 
dent. More specifically, we construct stopping-times < k{1) < k{2) < ■ ■ ■ using an 
auxiliary randomization procedure such that 

P{Xn+i e A, k{z) =n + l\Xn = x, K{i) > n > ^(i - 1)} = | "^^^^ if ^ e G, 

I- otherwise. 

(7.17) 

Then by Lemma 3.1 of Ney and Nummelin (1987), 

(i) {^(i + 1) — K{i) -.1 = 1,2,...} are i.i.d. random variables. 

(ii) the random blocks {X^^j), . . . , i = 1,2,..., are independent and 

(iii) P{X^(j) G = (p{A), where .F„=cr-field generated by {Xq, . . . 

Define k = k{1). By (ii)-(iii), E^{Sk — k/j.) = 0. We preface the proof of (7.16) 
with the following preliminary lemmas, whose proofs are given in Chan, Fuh and Hu 
(2005). 
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Lemma 3 Let 7(0;) = Ex{Sii — k/j,). Then = {Sn — n/x) + 7(-^n) is a martingale 
with respect to J^n- Hence 

(7.18) ESr = ii{Et) - EYi{X,)\ + E[^{Xo)]. 

Lemma 4 Under A3-A5, 

|7(x)| < b-\v{x) + h+{y* + h)V*{a-^ + l)]{K + l + \^x\), 
where a satisfies (2.12), V* is defined in (2.15) and K is defined in (2.16). 

Let Wi = \j{X^(i))\ + ■■■ + |7(^K(i+i)-i)|, for i > 1. Then by A3-A5, Lemma 
4 and its proof, and (i)-(m), Wi,W2,--- are i.i.d. with finite mean while by (2.15), 
Wo := |7(Xo)| + h |7(-'^k;(i)-i)| also has finite mean. 

Lemma 5 Let Mn = maxi<jfc<„ Wk- Then for any stopping-time t, E{Mr) = o{Et). 

Proof of (7.16). By Lemma 5, E\-f{Xr)\ + E\-^{Xq)\ = o{Et), and (7.16) follows 
from (7.18). □ 
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